Unicode

class U_COMMON_API Unicode

The Unicode class allows you to query the properties associated with individual Unicode character values

Public Classes
enum EUnicodeGeneralTypes: Public data for enumerated Unicode general category types
enum EDirectionProperty: This specifies the language directional property of a character set
enum ECellWidths: Values returned by the getCellWidth() function

Public Fields
static const UChar MIN_VALUE: The minimum value a UChar can have
static const UChar MAX_VALUE: The maximum value a UChar can have

Public Methods
static bool_t isLowerCase(UChar ch): Determines whether the specified UChar is a lowercase character according to Unicode 2
static bool_t isUpperCase(UChar ch): Determines whether the specified character is an uppercase character according to Unicode 2
static bool_t isTitleCase(UChar ch): Determines whether the specified character is a titlecase character according to Unicode 2
static bool_t isDigit(UChar ch): Determines whether the specified character is a digit according to Unicode 2
static bool_t isDefined(UChar ch): Determines whether the specified numeric value is actually a defined character according to Unicode 2
static bool_t isControl(UChar ch): Determines whether the specified character is a control character according to Unicode 2
static bool_t isPrintable(UChar ch): Determines whether the specified character is a printable character according to Unicode 2
static bool_t isBaseForm(UChar ch): Determines whether the specified character is of the base form according to Unicode 2
static bool_t isLetter(UChar ch): Determines whether the specified character is a letter according to Unicode 2
static bool_t isJavaIdentifierStart(UChar ch): A convenience method for determining if a Unicode character is allowed as the first character in a Java identifier
static bool_t isJavaIdentifierPart(UChar ch): A convenience method for determining if a Unicode character may be part of a Java identifier other than the starting character
static bool_t isUnicodeIdentifierStart(UChar ch): A convenience method for determining if a Unicode character is allowed to start in a Unicode identifier
static bool_t isUnicodeIdentifierPart(UChar ch): A convenience method for determining if a Unicode character may be part of a Unicode identifier other than the starting character
static bool_t isIdentifierIgnorable(UChar ch): A convenience method for determining if a Unicode character should be regarded as an ignorable character in a Java identifier or a Unicode identifier
static UChar toLowerCase(UChar ch): The given character is mapped to its lowercase equivalent according to Unicode 2
static UChar toUpperCase(UChar ch): The given character is mapped to its uppercase equivalent according to Unicode 2
static UChar toTitleCase(UChar ch): The given character is mapped to its titlecase equivalent according to Unicode 2
static bool_t isSpaceChar(UChar ch): Determines if the specified character is a Unicode space character according to Unicode 2
static int8_t getType(UChar ch): Returns a value indicating a character category according to Unicode 2
static EDirectionProperty characterDirection(UChar ch): Returns the linguistic direction property of a character
static EUnicodeScript getScript(UChar ch): Returns the script associated with a character
static uint16_t getCellWidth(UChar ch): Returns a value indicating the display-cell width of the character when used in Asian text, according to the Unicode standard (see p
static int32_t digitValue(UChar ch): Retrives the decimal numeric value of a digit character
static const char* getVersion(void): Retrieves the Unicode Standard Version number that is used

Documentation

The Unicode class allows you to query the properties associated with individual Unicode character values.
The Unicode character information, provided implicitly by the Unicode character encoding standard, includes information about the sript (for example, symbols or control characters) to which the character belongs, as well as semantic information such as whether a character is a digit or uppercase, lowercase, or uncased.
@subclassing Do not subclass.

static const UChar MIN_VALUE

The minimum value a UChar can have. The lowest value a UChar can have is 0x0000.

static const UChar MAX_VALUE

The maximum value a UChar can have. The greatest value a UChar can have is 0xffff.

enum EUnicodeGeneralTypes

Public data for enumerated Unicode general category types

enum EDirectionProperty

This specifies the language directional property of a character set

enum ECellWidths

Values returned by the getCellWidth() function

See Also:: getCellWidth

static bool_t isLowerCase(UChar ch)

Determines whether the specified UChar is a lowercase character according to Unicode 2.1.2.

Returns:: true if the character is lowercase; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isUpperCase
isTitleCase
toLowerCase

static bool_t isUpperCase(UChar ch)

Determines whether the specified character is an uppercase character according to Unicode 2.1.2.

Returns:: true if the character is uppercase; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isLowerCase
isTitleCase
toUpperCase

static bool_t isTitleCase(UChar ch)

Determines whether the specified character is a titlecase character according to Unicode 2.1.2.

Returns:: true if the character is titlecase; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isUpperCase
isLowerCase
toTitleCase

static bool_t isDigit(UChar ch)

Determines whether the specified character is a digit according to Unicode 2.1.2.

Returns:: true if the character is a digit; false otherwise.
Parameters:: ch - the character to be tested

static bool_t isDefined(UChar ch)

Determines whether the specified numeric value is actually a defined character according to Unicode 2.1.2.

Returns:: true if the character has a defined Unicode meaning; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isDigit
isLetter
isLetterOrDigit
isUpperCase
isLowerCase
isTitleCase

static bool_t isControl(UChar ch)

Determines whether the specified character is a control character according to Unicode 2.1.2.

Returns:: true if the Unicode character is a control character; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isPrintable

static bool_t isPrintable(UChar ch)

Determines whether the specified character is a printable character according to Unicode 2.1.2.

Returns:: true if the Unicode character is a printable character; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isControl

static bool_t isBaseForm(UChar ch)

Determines whether the specified character is of the base form according to Unicode 2.1.2.

Returns:: true if the Unicode character is of the base form; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isLetter
isDigit

static bool_t isLetter(UChar ch)

Determines whether the specified character is a letter according to Unicode 2.1.2.

Returns:: true if the character is a letter; false otherwise.
Parameters:: ch - the character to be tested
See Also:: isDigit
isLetterOrDigit
isUpperCase
isLowerCase
isTitleCase

static bool_t isJavaIdentifierStart(UChar ch)

A convenience method for determining if a Unicode character is allowed as the first character in a Java identifier.

A character may start a Java identifier if and only if it is one of the following:

a letter
a currency symbol (such as "$")
a connecting punctuation symbol (such as "_").

Returns:: TRUE if the character may start a Java identifier; FALSE otherwise.
Parameters:: ch - the Unicode character.
See Also:: isJavaIdentifierPart
isLetter
isUnicodeIdentifierStart

static bool_t isJavaIdentifierPart(UChar ch)

A convenience method for determining if a Unicode character may be part of a Java identifier other than the starting character.

A character may be part of a Java identifier if and only if it is one of the following:

a letter
a currency symbol (such as "$")
a connecting punctuation character (such as "_").
a digit
a numeric letter (such as a Roman numeral character)
a combining mark
a non-spacing mark
an ignorable control character

Returns:: TRUE if the character may be part of a Unicode identifier; FALSE otherwise.
Parameters:: ch - the Unicode character.
See Also:: isIdentifierIgnorable
isJavaIdentifierStart
isLetter
isDigit
isUnicodeIdentifierPart

static bool_t isUnicodeIdentifierStart(UChar ch)

A convenience method for determining if a Unicode character is allowed to start in a Unicode identifier. A character may start a Unicode identifier if and only if it is a letter.

Returns:: TRUE if the character may start a Unicode identifier; FALSE otherwise.
Parameters:: ch - the Unicode character.
See Also:: isJavaIdentifierStart
isLetter
isUnicodeIdentifierPart

static bool_t isUnicodeIdentifierPart(UChar ch)

A convenience method for determining if a Unicode character may be part of a Unicode identifier other than the starting character.

A character may be part of a Unicode identifier if and only if it is one of the following:

a letter
a connecting punctuation character (such as "_").
a digit
a numeric letter (such as a Roman numeral character)
a combining mark
a non-spacing mark
an ignorable control character

Returns:: TRUE if the character may be part of a Unicode identifier; FALSE otherwise.
Parameters:: ch - the Unicode character.
See Also:: isIdentifierIgnorable
isJavaIdentifierPart
isLetterOrDigit
isUnicodeIdentifierStart

static bool_t isIdentifierIgnorable(UChar ch)

A convenience method for determining if a Unicode character should be regarded as an ignorable character in a Java identifier or a Unicode identifier.

The following Unicode characters are ignorable in a Java identifier or a Unicode identifier:

0x0000 through 0x0008, ISO control characters that

0x000E through 0x001B, are not whitespace

and 0x007F through 0x009F

0x200C through 0x200F join controls

0x200A through 0x200E bidirectional controls

0x206A through 0x206F format controls

0xFEFF zero-width no-break space

Returns:: TRUE if the character may be part of a Unicode identifier; FALSE otherwise.
Parameters:: ch - the Unicode character.
See Also:: isJavaIdentifierPart
isUnicodeIdentifierPart

static UChar toLowerCase(UChar ch)

The given character is mapped to its lowercase equivalent according to Unicode 2.1.2; if the character has no lowercase equivalent, the character itself is returned.

A character has a lowercase equivalent if and only if a lowercase mapping is specified for the character in the Unicode 2.0 attribute table.

Unicode::toLowerCase() only deals with the general letter case conversion. For language specific case conversion behavior, use UnicodeString::toLower(). For example, the case conversion for dot-less i and dotted I in Turkish, or for final sigma in Greek.

Returns:: the lowercase equivalent of the character, if any; otherwise the character itself.
Parameters:: ch - the character to be converted
See Also:: toLower
isLowerCase
isUpperCase
toUpperCase
toTitleCase

static UChar toUpperCase(UChar ch)

The given character is mapped to its uppercase equivalent according to Unicode 2.1.2; if the character has no uppercase equivalent, the character itself is returned.

Unicode::toUpperCase() only deals with the general letter case conversion. For language specific case conversion behavior, use UnicodeString::toUpper(). For example, the case conversion for dot-less i and dotted I in Turkish, or ess-zed (i.e., "sharp S") in German.

Returns:: the uppercase equivalent of the character, if any; otherwise the character itself.
Parameters:: ch - the character to be converted
See Also:: toUpper
isUpperCase
isLowerCase
toLowerCase
toTitleCase

static UChar toTitleCase(UChar ch)

The given character is mapped to its titlecase equivalent according to Unicode 2.1.2. There are only four Unicode characters that are truly titlecase forms that are distinct from uppercase forms. As a rule, if a character has no true titlecase equivalent, its uppercase equivalent is returned.

A character has a titlecase equivalent if and only if a titlecase mapping is specified for the character in the Unicode 2.1.2 data.

Returns:: the titlecase equivalent of the character, if any; otherwise the character itself.
Parameters:: ch - the character to be converted
See Also:: isTitleCase
toUpperCase
toLowerCase

static bool_t isSpaceChar(UChar ch)

Determines if the specified character is a Unicode space character according to Unicode 2.1.2.

Returns:: true if the character is a space character; false otherwise.
Parameters:: ch - the character to be tested

static int8_t getType(UChar ch)

Returns a value indicating a character category according to Unicode 2.1.2.

Returns:: a value of type int, the character category.
Parameters:: ch - the character to be tested
See Also:: UNASSIGNED
UPPERCASE_LETTER
LOWERCASE_LETTER
TITLECASE_LETTER
MODIFIER_LETTER
OTHER_LETTER
NON_SPACING_MARK
ENCLOSING_MARK
COMBINING_SPACING_MARK
DECIMAL_DIGIT_NUMBER
OTHER_NUMBER
SPACE_SEPARATOR
LINE_SEPARATOR
PARAGRAPH_SEPARATOR
CONTROL
PRIVATE_USE
SURROGATE
DASH_PUNCTUATION
OPEN_PUNCTUATION
CLOSE_PUNCTUATION
CONNECTOR_PUNCTUATION
OTHER_PUNCTUATION
LETTER_NUMBER
MATH_SYMBOL
CURRENCY_SYMBOL
MODIFIER_SYMBOL
OTHER_SYMBOL

static EDirectionProperty characterDirection(UChar ch)

Returns the linguistic direction property of a character.

Returns the linguistic direction property of a character. For example, 0x0041 (letter A) has the LEFT_TO_RIGHT directional property.

See Also:: EDirectionProperty

static EUnicodeScript getScript(UChar ch)

Returns the script associated with a character

See Also:: EUnicodeScript

static uint16_t getCellWidth(UChar ch)

Returns a value indicating the display-cell width of the character when used in Asian text, according to the Unicode standard (see p. 6-130 of The Unicode Standard, Version 2.0). The results for various characters are as follows:

ZERO_WIDTH: Characters which are considered to take up no display-cell space: control characters format characters line and paragraph separators non-spacing marks combining Hangul jungseong combining Hangul jongseong unassigned Unicode values

HALF_WIDTH: Characters which take up half a cell in standard Asian text: all characters in the General Scripts Area except combining Hangul choseong and the characters called out specifically above as ZERO_WIDTH alphabetic and Arabic presentation forms halfwidth CJK punctuation halfwidth Katakana halfwidth Hangul Jamo halfwidth forms, arrows, and shapes

FULL_WIDTH: Characters which take up a full cell in standard Asian text: combining Hangul choseong all characters in the CJK Phonetics and Symbols Area all characters in the CJK Ideographs Area all characters in the Hangul Syllables Area CJK compatibility ideographs CJK compatibility forms small form variants fullwidth ASCII fullwidth punctuation and currency signs

NEUTRAL: Characters whose cell width is context-dependent: all characters in the Symbols Area, except those specifically called out above all characters in the Surrogates Area all charcaters in the Private Use Area

For Korean text, this algorithm should work properly with properly normalized Korean text. Precomposed Hangul syllables and non-combining jamo are all considered full- width characters. For combining jamo, we treat we treat choseong (initial consonants) as double-width characters and junseong (vowels) and jongseong (final consonants) as non-spacing marks. This will work right in text that uses the precomposed choseong characters instead of teo choseong characters in a row, and which uses the choseong filler character at the beginning of syllables that don't have an initial consonant. The results may be slightly off with Korean text following different conventions.

static int32_t digitValue(UChar ch)

Retrives the decimal numeric value of a digit character

Returns:: the numeric value of ch in decimal radix. This method returns -1 if ch is not a valid digit character.
Parameters:: ch - the digit character for which to get the numeric value

static const char* getVersion(void)

Retrieves the Unicode Standard Version number that is used

Returns:: the Unicode Standard Version Number.

This class has no child classes.

alphabetic index hierarchy of classes

this page has been generated automatically by doc++

(c)opyright by Malte Zöckler, Roland Wunderling
contact: doc++@zib.de

0x0000 through 0x0008,	ISO control characters that
0x000E through 0x001B,	are not whitespace
and 0x007F through 0x009F
0x200C through 0x200F	join controls
0x200A through 0x200E	bidirectional controls
0x206A through 0x206F	format controls
0xFEFF	zero-width no-break space

class U_COMMON_API Unicode

Public Classes

Public Fields

Public Methods

Documentation